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ABSTRACT 

Although Item bias statistics are widely recommended 
for use m test development and test analysis work, problems arise in 
their Interpretation. The purpose of the present research was to 
evaluate the validity of logistic test models and computer simulation 
methods for providing a frame of reference for item bias statistic 
Interpretations. Specifically, the intent was to produce simulated 
sampling distributions of item bias statistics under the hypothesis 
of no bias for use in determining cut-off points to provide 
guidelines for interpreting item bias statistics obtained with actual 
test data. The test data used were the item scores of 207 white and 
730 blacK Cleveland (Illinois) ninth graders to the 75 items on the 
1985 Cleveland Reading Competency Test. The area, root mean squared 
difference, and Mantel-Haenszel methods were used to statistically 
analyze the data. The results support the basic data simulation 
approach used In this study. Real and simulated distribution for 
three item bias statistics when bias was not present were very 
similar and the minor differences that were found between the 
distributions had little effect on the interpretations of item bias 
statistics obtained with actual test data. Seven steps for applying 
the method of computer-simulated baseline statistics in test 
development settUigs are outlined. One data table and four graphs 
conclude the document. (Author/SLD) 



• Reproductions supplied by EDRS are tne best that can be made 

• from the original document. 
•••••••••••• 



Evaluation of Computer Simulated Baseline Statistics 
for Use in Item Bias Studies 



H. Jane Rogers and Ronald K. Hambleton 
University of Massachusetts at Amherst 



U 8. OCPAITTMENT Of EDUCATION 

Office ol EducaiK>n«t Rmarch and tmprov»ment 

EDUCATIONAL RESOURCES INFORMATION 
y CENTER (ERIC) 

^hi8 documtnt hat h«en reproduced es 
received from the pereon or organization 
originating it 

n Mir\OT changes have t>een made to improve 
reproduction quality 

• Points of view or opiniont stated m this docu- 
ment do not necessarily represent official 
OERI position or policy 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE: EDUCATIONAL RESOURCES 
->IF0RMATI0N CENTER uHlC)" 



BEST COPY AVAILABLE 



5/30/88 



Evaluation of Computer Simulated Baseline Statistics 
for Use in Itea Bias Studies^ 

B. Jane ?ogers and Ronald K. Baibleton 
University of Massachusetts at Aaherst 

Abstract 

Though itei bias statistics are widely recouended for use in test 
development and test analysis work, problems arise in their 
interpretation. The purpose of the present research was to evaluate 
the validity of logistic test lodels and conputer simulation methods 
for providing a frame of reference for item bias statistic 
interpretations. Specifically, the intent wus to produce simulated 
sampling distributions of item bias statistics under the hypothesis of 
no I ias for use in determioing cut-off points to provide guidelines foe 
interpreting item bias statistics obtained with actual test data. 

The results provided support for the basic data simulation 
approach used in the study. Real and simulated distributions for three 
item bias sta'cistics when bias was not present were very similar and 
the minor differences that were found between tne distributions had 
little effect on the interpretations of item bias statistics obtained 
with actual test data. Seven steps for applying the method of 
computer-simulated baseline statistics in test development settings 
were outlined in the paper. 
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The great public concern in this country over unfairness or bias 
in testing has resulted in substantial nuabers of rf-iearch studies that 
have described and evaluated new aethods for identifying potentially 
biased test iteas (Berk, 1982; Shepard, Caiilli and Averill, 1981; 
Shepard, Caailli and Villiaas 1985) . Most of the new Methods based 
upon use of itea response aodels and related procedures involve the 
calculation of statistics which are unfamiliar to test developers 
(e.g., weighted b value differences, area between two itea 
characteristic curves, sua of squared differences between two itea 
characteristic curves) . 

One problea that has arisen In test development work concerns the 
interpretations o! these n«w itea bias statistics. Certainly the 
statistics, whatever their interpretation, can be used to rank-order 
test iteas to identify the iteas of most and least concern. As test 
developers often want to sort test iteas into ordered categories (e.g., 
"aust be very carefully reviewed", "may need revision", "should be 
acceptable"), critical values or cut-off points for classifying the 
itea bias statistics would be useful. The advantage of a 
classificatory approach, as opposed to an approach based upon item 
rankings, is that the number of potentially biased items does not need 
to be specified in advance of the analysis. Thus the number of items 
identified as potentielly biased would depend on the dataset. Of course 
the main difficulty in placing items into categories is determining a 
frame of reference and subsequently cut-off scores for interpreting the 
IRT item bias statistics of interest. 
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Th« mIb purpott of the present research was to evaluate the 
validity o£ logistic test Models and computer siaulation aethods for 
generating saspling distributions of itea bias statistics under the 
hypothesis of no ilea bias. These distributions are intended for use 
in setting cut-off points to provide baselines for interpreting item 
bias statistics. A secondary purpose was to highlight the use of the 
Methods in an itea bias study. 

This study was prompted by som earlier research by Haableton, 
Rogers, and Arrasiith (1986) . These authors carried out a similar 
study obtaining baselines from the analysis of real data provided by 
two randomly equivalent majority samples and by two randomly equivalent 
minority samples. Although meaningful baseline results are available 
by conducting item bias studies on randomly equivalent samples, the 
disadvantage of this approach is that the important comparisons between 
the majority and minority groups are carried out with sample sizes half 
that of those sample sizes that were actually available. 

Reduction of sample sizes by 50% to obtain baseline information is 
a high price to pay when initial sample sizes are often not very large. 
Small sample item bias studies are especially problematic h'hen IRT 
methods are used (Hoover and Kolen, 1984). Hambleton et al. (1986) 
also showed that logistic models could be used to provide simulated 
results to serve as a baseline for interpreting item bias statistics. 
It was clear, however, that more research was needed to strengthen 
their conclusion. 

Another way that item bias baseline statistics might be compiled 
is by combining the majority and minority groups of interest and then 
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by conducting tn itti bits invtstigation using two rtndoily equivalent 
sattples drawn froi the combined saaple (Shepard, Caailli, and Villians, 
1984; Vilson-Burt, ritzaartin, and Skaggs, 1986). U iten bias should 
not be present in two randomly equivalent groups, the distribution of 
itei bias statistics obtained in two randomly equivalent groups could 
serve as a basis for setting cut-off scores for interpreting itea bias 
statistics in the majority and minority samples • 

The lain shortconing of this approach a shortcoiing of the 
early Haableton et al. (1986) work too ~ is that any difference in 
the ability distributions between the majority and minority groups is 
not reflected in the two randomly equivalent samples used to obtain the 
baseline statistics. As group ability distributions can influence the 
quality of item bias statistics (e.g., Shepard et al., 1984; 
Wilson-Burt et al.^ 1986), failure to incorporate this information in 
the analysis could reduce the usefulness of the distribution oi item 
bias statistics derived from the two randomly equivalent samples. One 
solution that is sometimes applied when the majority group is large 
involves selecting an examinee sample from the majority group to 
approximate the dis':ribution of scores in the minority group (e.g., 
Shepard et al., 1984). On the other hand, such ability differences and 
other unique features of the majority and minority samples can be 
incorporated into a computer-simulated item bias analysis regardless of 
the available sample sizes. For this reason, the current research 
centered on the potential value of computer-simulation techniques for 
providing the desired baseline distributions. 
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Mtthod 

Choice of lte» Bi«« SUtistici 

ThrM popular itM bias statistics were chosen £or the investiga- 
tion: area aethod, root Man squared difference lethod, and the Mantel- 
Baensxel aethod. The choice of statistics was not of paranount 
iaportance to the study, as the purpose was to investigate the 
usefulness of simulated baseline distributions for the statistics 
rather than the value of the statistics theaselves. The Methodology to 
be proposed could be used with any of the other popular bias 
statistics, such as the pseudo-IRT, the full chi-square, or the 
residualized delta, for which the distributional properties are either 
known only approximately or not at all. Although the Mantel-Haenszel 
statistic does have a known distribution theoretically, it was included 
in the study because of the current interest in it, and because of 
quickness and ease of calculation. 

Area Method. In the Area Method, or Total Area Method as it is 
sometines called, the area between itei characteristic curves for the 
saae itei obtained in the majority and ainority groups over a specified 
interval on the ability scale (-3 to +3, in th.ls study) is used as a-- 
estiiate of it en bias (Rudner, Getson, and Knjght, 1980) . An itea is 
labeled as "potentially biased" when the area betwien the two curves is 
large. 

Root Mean Squared Difference Method . In applying this method 
(Linn, Levine, Hastings, and Wardrop, 1981), one calculates the squared 
difference between the majority and minority item characteristic curves 
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•t fix«d inUrvals (ututlly .01). Theat tqutr**! differences are 
calculated oTer the interval on the ability scale which is of interest. 
Finally, an average of the squared differences is calculated and the 
square root of the average is taken. Again, large-valued statistics 
reflect substantial differences between itea characteristic curves. 
Consequently, iteas associated with large-valued statistics are labeled 
as "potentially biased." 

Mantel-Hae nstel Method . The Mantel-Haenszel aethod has generated 
considerable interest aaong test developers in recent years because it 
appears to provide a quick, cheap, and valid indicator of itei bias 
(Holland and Thayer, 1986). Unlike the other two Methods, this aethod 
does not involve the application of itea response theory (IRT) models 
and principles. In essence, the aethod first Batches exaieinees on a 
criterion variable, often the overall test jcore because of 
convanience. The ratio of the odds for success of the lajoritv and 
Minority group aeibers are calculated in each score group of interest 
(with n iteas, with n+1 possible score groups). Each ratio is weighted 
by the saaple size in the score group and then the ratios for the (up 
to) n-i-1 sicore groups are combined to obtain the Nantel-Haenszel 
sty.cistic. When the odds for success on an itea in the majority and 
■inority groups aaong exaainees of the same ability level are 
substantially different, item bias is suspected. The advantage of this 
method over the other two previously described ones is that there is an 
associated statistical test with a known sampling distribution 
(chi-square with one degree of freedom) . Thus meaningful cutoff scores 
can be established. This statistic was considered because of the 
substantial interest in its use in item bias work. 
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D»icrlPtlon of th» Taat D>t> and Ex—inM S— p1< 

Th« teat data us«d in th« study w«r« the itea scores of 937 
Cleveland ninth-grade students to 75 iteas on the 1985 Cleveland 
Reading Coapetency Test (Cleveland Public Schools, 1985). In the total 
saaple, 207 Whites -nd 730 Blacks were present, of whoM 451 were nales 
and 486 were females. Because of the very saall nunber of whites in the 
sanple, only a sex bias study was completed. 
Generation of Sinulated Examinee itn Scores 

Basically, the approach was to simulate examinee itei score data 
that reflected as closely as possible the actual exasinee and itea data 
of interest without any iten bias. Itei paraaeter and ability 
parameter estiaates obtained f roa the coabined group three-i iraaeter 
logistic aodel analysis were treated as "true values" and tuen a 
siaulated set of itea scores for the 937 exaainees was generated by 
using the three-paraaet«r logistic aodel (Haableton and Rovinelli, 
1973) . 

With known ability, e, and aodel paraaeters for itea i, denoted 
ai, bi, ci, the probability of the exaainee answering the itea 
correctly was assuaed to be given by the three-paraaeter logistic 
aodel: 

Pi(e) - ci + (l-ci) [1 + e"^** 

With Pi(e) in hand, an itea score, 0 or 1, was obtained by first 
choosing a random nuaber froa a unifora distribution on the interval 
[0, 1]. If the randoa nuaber chosen was less than or equal to Pi(e), 
which happens Pi(e) of the tiae, the exaainee was scored 1; otherwise 
the exaainee was scored 0. This process was repeated for each of the 
75 iteas for the first exaainee using the itea paraaeter estiaates 
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obtained froi tht tntlysit of the 937 exuinees on the 75 itei test. 
Then, the ability score for the second extninee wis substituted for the 
first exaninee in Equation [1] , and the process of generating a vector 
of ites scores was repeated. This process was continued until 937 
vectors of itei scores were generated. 

The final product was a conplete set of item scores for the 937 
exaainees on the 75 iteas that were Manifested froa the thiee-paraaeter 
logistic aodel. The siaulated iten scores were generated to be 
consistent with the itea and ability paraaeter estiaates obtained with 
the rei»l data, but without bias. There wat uo bias because aale and 
feaale itea scores were generated froa a coaaon set of three-paraaeter 
itea characteristic curves. Any differences ii ability scores between 
the aajority and ainority groups were retained because the ability 
estiaates obtained froa the analysis of the real data were used in the 
siaulations. 

A parallel set of itea bias analyses was carried out on the real 
and siaulated data. Differences in the distributions of itea bias 
statistics would arise if bias were present in the real data, as in all 
other respects, the datasets were equivalent, if one assuaes, of 
course, that the three-paraaeter logistic aodel provided an appropriate 
fit to the real data. For this reason, the fit of the three-paraaeter 
logistic aodel to the test data was checked carefully (Haableton and 
Rogers, in press; Haableton and Swaainathan. 1985). 



With the actual and siaulated test data in handr three sets of 
analyses were coapleted. The first analysis was intended to evaluate 
the aerits of conputer siaulated baseline saapling distributions of 
itea bias statistics. This analysis involved the coaparison of 
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diitributioM o£ itti bias statistics obtained in randoily equi-'alsnt 
groups (no bias prsssnt) through using th« rsal data and the siaulated 
data. IB this study, the available saaples (real and siBulated) were 
halved in the analyses to provide a basis for evaluating the aerits o£ 
the chosen siaulation aethods. 

The second analysis was intended to address the comparative 
effects of employing siaulated rather than real sampling distributions 
in setting cut-off scores. This analysis involved (a) setting cut-off 
scores with both the real and siaulated sampling distributions of itea 
bias statistics obtained under the true hypothesis of no bias and (b) 
coMparing the effect of the different cut-off scores on the number of 
items labelled "potentially biased" in a sex bias study. 

Tnc third and final analysis was an application of the new method 
in a male-l'emale item bias study. In this analysis, the purpose was to 
highlight 'jow the method can work in practice. 

The specific steps in the procedure were \% follows: 
1. The real dataset was split into 4 subgroups, two male and two 
female, denoted Mi, M2, Fi, and F2. The Mi and M2, and the Fi 
and Ft subgroups were randomly equivalent. Subgroups were 
formed so t^^at an item bias study in the two randomly 
equivalent male samples and in the two randomly equivalent 
female samples could be achieved. The distribution of these 
item bias statistics (no bias present) provided a basis for 
evaluating the distribution generated from the simulated test 
data. Next, the simulated test data were also divided into 
four subgroups: Mi, Ms, Fi , and F2. In this way, item bias 
statistics in the Mi and Fi and in the Ms and Fs samples in 
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the tiiulated data could be calculated for the purpose of 
producing a saapling distribution of each itei bias statistic 
of interest under the hypothesis of no bias . Both Mi and Fi 
and M;i and Fa couparisons were preferred to the corresponding 
Ml and Ki and the Fi and Fa conparisons because the forner 
subgroups reflected any real ability differences in the nale 
and feaale sasples, whereas the latter subgroups did not. 

2. Separate nodified three-paraaeter aodel analyses of the Mi, 
lU' and Fa real and siiulated d&ta were carried out. The 
c parameter was fixed at a y^Iue of .20. Eight IRT analyses, 
in all, were completed. Ability estinates obtained from the 
combined group analysis were also fixed in these analyses. 

3. After the necessary data rescalings, two of the item bias 
statistics of interest — Area end Root Mean Squared 
Difference — were calculated for the w^oap comparisons listed 
below: 



Real Data 



a. 



Ml vs Fi 



b. 



Ma vs Fa 



(this analysis served as a replication 



of the study with the Mi and Fi samples) 



Ml vs Ma 



Fi vs Fa 



M vs F 



Simulated Data 



Ml vs Fi 



Ma VS Fa 



M vs F 
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Tho M«nttl-aatttfztl ttatiftiet wert calculated using th« itea 
rMpons* data proTidtd at step 1. 

4. For each itm bias statistic, tho following distributions were 
obtained: 

R eal Data 

a. The coabiaed distribution of Ht vs M;i an*: Fi rs Ft itea 
bias statistics. (This distrib'ation served as the 
baseline for interpreting the real itea bias statistics 
obtained froa the Ni vu Fi and !|i vs Ft comparisons. ) 

b. The distributions of the Ni vs Fi and of the lb vs Ft 
itea bias statij'ics. (The Ha vs F« comparison 
served as a replication of the Mi vs Fi conpariso^).) 

Simulated Dat<t 

c. The coabined distribution of Mi vs Fi and of lU vs itea 
bias statistics. (This distribution served as the 
alternate baseline for interpreting the real itea 
bias statistics obtained froa the Mi vs Fi , and Mt 
vs Ft groups.) This distribution was coapared to 
4(a) cited previously to assess the viability of the 
coaputer-generated saapling distributions of itea 
bias statistics. 

5. The distributions obtained in step 4 (except for t!ie real M 
vs F coaparison) were saoothed by the aethod of "weighted 
rolling averages" (Kendall and Stuart, 1968) to reaove soae of 
the ainor irregularities in the distributions. 
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6. Th« cut-off fcur« corrMpot^^ina to th« .05 levtl cf signifi- 
cunct for each distribution (real and tiaulated} generated 
under the hypotheiif of no bias was detemined. 

7. The cut-off scores obtained at step 6 were applied to the real 
itea bias statistics to coapare their effects. 

In a final phase ^f the research, the IRT computer siiulation 
■ethod was used to provide a baseline distribution for interpreting 
itea bias statistics obtained in the full Mle and feaale samples. 

Results 

Model-Duta Fit 

The isults froB this study would have been Meaningless unless the 
three-paraaeter logistic aodel had at least provided an adequate 
accounting of the actual itei score data. Fortunately, the Model fit 
the test data well. The average residual (actual perforiance-expected 
performance assuaing lodel-data fit) was .01. This average was based 
on 12 coiparisons (at ability levels -2.75, -2.25, 2.75) of the 
observed and expected performance for each of the 75 items in the test. 
Clearly, there was no overall bias in the fit of the item and ability 
parameter estimates to the test data. The average residual calculated 
at each of the same ability levels across the 75 items was also very 
small. It exceeded a value of .05 at four ability levels, -2.75, 
-2.25, -1.75, and 2.75 where the combined examinee sample was only 71 
(about 7.5% of the total sample). In sum, the goodness-of-fit results 
inc'-'.'.tted a close fit between the three-parameter logistic model and 
the actual test data. 
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CoMPMflion of the Real and Siaulated Mull Digtribatic ng 

Figures 1, 2, and 3 provide the Mof>thed distributions under the 
hypothesis of no bias for the three iten bias statistics with both real 
and sinulated data. The results were clear: There was very little 



Invert Figures 1, 2, and 3 about here 



difference between the saapling distributiok.5 of the item bias 
statistics generated with real and sinulated data. The aaxiitta 
difference in the saapling distributions with real and sinulated data 
was 7.8%. Also, the largest differences were always observed in the 
lower halves of the sampling distributions where the consequences of 
differences between the distributions on the determination of cut-off 
values were small. 

Effect of Choice of Sam p ling Distribution 

Perhaps the best way to judge the effects of choosing the 
sinulated over th'-^ real distributions of item bias statistics under the 
hypothesis of no bias is in terms of the practical consequences of 
using the cut-off scores obtained from the two distributions. Table 1 
provide-; the .05 cut-off score for the real and simulated distributions 
for each item bias statistic under the hypothesis of no bias. These 
eot-off scores corresponded to the 95th percentile of the distribution 
of statistics in each case. These cut-off scores were then applied to 
the Ml vs.JTi and to the Ma vs._F8 real item bias data. 



Insert Table 1 about here 
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T«ble 1 tbows that th«re were differences in the values of cut-off 
scores obtained with the real and simulated distributions. These 
differences influenced the numbers of tept iteas identified at the .05 
level, though the influence of choice of distribution appeared to be 
SMll. Across six coaparisons, the average difference was three iteas. 
In view ol the close siailarity in the distributions as reflected in 
Figures 1, 2, and 3, it is likely that the differences reflected, to a 
great extent, the instability in determining the 95th percentile 
because of the very limited amounts of data in the tails of the 
distributions. Smoothing the simulated distributions was helpful, but 
basically the problem remained: there was a limited number of data 
points in the tails of the distributions. In addition, some differences 
in the results were expected because the simulated distributions 
reflected the ability distribution differences in the male and female 
samples to a greater degree than the real null distributions under the 
hypothesis of no biaa. 
An Kxample 

Though samples of (approximately) 450 males and females were 
available for the research investigation, it was necessary to divide 
each sample in half so that various comparisons of results could be 
made to evaluate the merits of the computer simulation. In practice, a 
test developer would carry out the item bias study with the full set of 
available data. Figure 4 highlights the results of an item bias 
investigation (ufcing the Total Area Statistic) with. the full male and 
female samples, and the smoothed computer-simulated distribution of the 
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total ar$« bias statifties without any bias. Tht .05 level of 

•ignificance was chosen to identify iteis in need of careful review. 
The nuMber of items identified was eight. Siailar analyses were 
carried out with the other two bias statistics of interest in this 
study. Six itess were identified with the Root Mean Squared Difference 
Method; while seven iteM were identified with the Mantel-Baenszel 
■ethod. 



Insert Figure 4 about here 



Conclusions 

The Main results of this study reported in Figures 1 to 3 provided 
support for the use of simulated data to establish critical values for 
IRT itei bias statistics. VheQ the test data fit the model chosen, use 
of the IRT parameter estimates to generate data allovir the test 
developer to siaulate samples closely resembling the original data but 
under conditions of no bias. Though the results in Figures 1 to 3 do 
not provide evidence of the importance of retaining ability differences 
in the simulations of majority and minority group performance, 
nevertheless, preserving these differences to enhance the validity of 
the simulated sampling distributions seems desirable. Given the 
practical limitations of IRT parameter estimation, particularly in 
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r«lttivtl7 Hill tuplef, rttaining these ability distribution 
difftrtncts uy bt iaportant, as they may affect the IRT iten bias 
statistics. Vhen randoily equivalent samples of the real data are used 
to establish cutoff values for the bias statistics, this consideration 
is not taken into account. Hence simulating the ability differences 
under conditions of no bias probably allows the investigators to set 
■ore realistic cutoff values for the bias statistics. 

In the present study, taking ability distribution differences, 
though slight, into account, produced higher cutoff values with 
IRT-bai.ad methods than were obtained from using random samples of the 
real data. The result was the flagging of fewer items as biased. 
Given that the groups were males and females, and that no substantial 
bias wa& expected, the direction ot the observed differences supports 
the use of simulated data to establish cut-off points for the IRT item 
bias statistics. 

The lack of agreement observed between the two replications of the 
bias analysis in the real data (as revealed in Table 1) highlights the 
problem of using IRT methods in small samples. Substantially better 
results should be obtainable with larger sample sizes. But with small 
samples, researchers should be cautioned against using any firm cut-off 
score for the bias statistics. In the small sample case it is 
recommended that the simulated data baseline be used more to give a 
5ense of what is extreme in the values of the bias statistics than to 
la^el an item as potentially biased or not. Smoothing distributions 
definitely reduced the problem of unstable cut-off points; using larger 
samples would be very helpful too. 
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The rtsults for the Mantel-Bttnszel statistic suggest that 
although data can be generated which will return IRT paraaeter 
estiiates siailar to those obtained froa the real data, it is aore 
difficult to generate response patterns that closely reseable the real 
data. Hence, the aethod proposed in this paper of siaulating data to 
establish baseline values aay not be useful for bias statistics that 
are not derived froa IRT aodels. 

In suaaary, application of the IRT coaputer siaulation aethod for 
generating baseline distributions of itea bias statistics is as 
follows: 

1. Choose an IRT aodel and estiaate itea and ability paraaeters 
for the total group of exaainees. Assess aodel-data fit. 
Continue with the aethod if the aodel-data fit is acceptable. 
Otherwise choose a aore general IRT aodel to fit the data 
better. Iteas which are suspected of being biased can be 
reaoved froa the analysis at this step. Reaoval of iteas does 
not seea necessary unless the number of items suspected of 
being biased is a significant portion of the total number of 
iteas in the test (e.g., 10% or more). 

2. Treat the item and ability paraaeter estiaates as "true" 
values and generate a new set of examinee item scores by using 
the logistic aodel of choice in step 1 (e.g., Hambleton and 
Rovinelli, 1973). 

3. Split the simulated examinee itea scores into the majority and 
minority groups of interest and re-estimate the item 
parameters, while treating ability scores obtained at step 1 
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as fixed. (Fixing the ability scores serves two purposes: 
(a) itei parameter estiiation tiae is reduced substantially 
and (b) scaling problems with the data are considerably 
reduced.) 

4. Choose the IRT item bias statistic (or statistics) of interest 
and carry out the necessary calculat:<ons on the ICCs and 
ability estimates for the simulated majority and minority test 
data. 

5. Produce the sampling distribution of the item bite statistics 
obtained from the simulated data and smooth the distribution 
of resulting item bi&s statistics to remove some of the 
instability in determining cut-off scores. Determine the 
cut-off value corresponding to the 95th percentile (and/or 
other cut-off values of interest). 

6. Repeat steps 3 and 4 with the real test data. 

7. Interpret the item bias statistics obtained with the real test 
data at step 4 by using the cn -off values obtained from the 
simulated test data at step 

Test developers who carry out these seven steps will be able to 
interpret their item bias statistics more meaningfully due to the 
availability of information about the distribution of the item bias 
statistics when no bias is present. 
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Table 1 
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Effects cf the Choice of DiatributioD (Real or Siiulated) on the 
DeteraiDatioD of Cufi-off Scores and Identification 
of Potentially Biased Test Iteas 



Bias Statistic 



Area 



Root Hean Squared 
Difference 



Real Null Distribution Simulated Null Distribution Difference 

Critical Valued Biased IteM* Critical Value Biased Iteis 



.544 



.113 



4 

(il) 



4 

(10) 



.659 



.134 



1 

(6) 



3 

(3) 



3 

(5) 



1 

(7) 



Mantel-Haenszel 



3.42 



6 

(19) 



3.03 



6 

(21) 



0 

(2) 



I 



At the .05 level. 



« The numbers in brackets correspond to the numbers of test items identified as potentially biased 
ia a replication of the study with the second male and female samples. 
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Flottre Captions 

Figure 1. A coaparison of the tiiulated and real sampling distributions 
of the Itea Area Statistics under the hypothesis of no bias. 

Figure 2. A coaparison of the siaulated and real sampling distributions 
of the Item Root Mean Squared Difference statistics under the 
hypothesis of no bias. 

Figure 3. A comparison of the siaulatod and real sampling distributions 
of the Item Mantel -Haenszel statistics under t-he hypothesis of 
no bias. 

Figure 4. A comparison of the distribution of Item Area Statistics for 
the total male and female groups, and the smoothed 
distribution of the same statistic for the total simulated 
male and female groups under the hypothesis of no bias. 
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